GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597
GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597peter-toth wants to merge 7 commits into
RowRanges.Builder for incremental construction from selected row indices#3597Conversation
…m selected row indices ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time. ### What changes are included in this PR? Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException. ### Are these changes tested? Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths. ### Are there any user-facing changes? No. Closes apache#3596 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
e11f3d9 to
58857df
Compare
|
@wgtmac, could you please take a look at this PR? |
| * @return the constructed {@link RowRanges}, or {@link RowRanges#EMPTY} when no rows were | ||
| * selected. | ||
| */ | ||
| public RowRanges build() { |
There was a problem hiding this comment.
build() should return a snapshot or become terminal. As written, the returned RowRanges shares the builder’s mutable ranges list, so later calls on the same builder can mutate a previously built result and even break the sorted-ranges invariant.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
addSelected should reject negative row indexes explicitly. -1 also collides with the builder’s sentinel state, so invalid input can be silently swallowed or corrupt the active run.
| @@ -316,4 +316,72 @@ public List<Range> getRanges() { | |||
| public String toString() { | |||
| return ranges.toString(); | |||
| } | |||
|
|
|||
| /** | |||
There was a problem hiding this comment.
This needs a clearer API commitment. The motivation is external readers/Spark, but the type lives under internal and the PR says there are no user-facing changes. If this is meant to be supported externally, the package/API contract should say so; otherwise downstream users may depend on an unsupported API.
There was a problem hiding this comment.
Yeah, at the time, it was yet another approach to tighten the amount of publicly available API in parquet-java. It quickly turned out that 3rd parties want to use it. For the next major release it is a good candidate to be moved to a more public package. Maybe deprecate it here, and also create at another location?
There was a problem hiding this comment.
@wgtmac, @gszadovszky, I wonder which package could be the right place for RowRanges and the new builder? Shall I try to move/deprecate them in this PR or separate one?
There was a problem hiding this comment.
A proper target package would be the same as now just skip the internal part. I am fine moving/deprecating in this PR. Theoretically, we can simply move it without deprecation since it is "internal" currently, so we have no guarantees for backward compatibility. WDYT, @wgtmac?
There was a problem hiding this comment.
Oh, I haven't noticed that, sorry. Then, we probably need to get back to the deprecated/move pattern. 😞
Let's move to the new place, extend the class as required at the old place and deprecate. Also deprecate the public API that references the old place and create new methods for the new one. Mark 2.0 as the removal version. WDYT?
There was a problem hiding this comment.
I looked at the released (1.17.0) surface: the only genuinely-public, cross-package method exposing RowRanges is ParquetFileReader.readFilteredRowGroup(int, RowRanges). ParquetFileReader.getRowRanges(int), that returns RowRanges was only made public on master (1.18.0).
Everything else is under org.apache.parquet.internal.*.
Is it ok if we preserve compatibility for just readFilteredRowGroup(int, RowRanges) and treat the internal.* relocations as internal changes with no compat guarantee?
There was a problem hiding this comment.
Is it ok if we preserve compatibility for just readFilteredRowGroup(int, RowRanges) and treat the internal.* relocations as internal changes with no compat guarantee?
It really depends on whether this relocation will break any existing users. As you've said that readFilteredRowGroup has been made public since 1.17.0 so we have to add a shim layer for the internal RowRanges anyway?
There was a problem hiding this comment.
Yes we do. Move and deprecation is done in fa56ff2.
Please note that org.apache.parquet.internal.* excludes still remained.
| * RowRanges ranges = builder.build(); | ||
| * }</pre> | ||
| */ | ||
| public static class Builder { |
There was a problem hiding this comment.
Consider making Builder final and adding an explicit constructor. Otherwise subclassing and new RowRanges.Builder() become part of the public API surface by accident.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
The naming could be clearer before this API ships. addSelected(long blockRow) is ambiguous; something like addSelectedRow(long rowIndex) would better communicate that the value is a 0-based row index within the current row group.
|
Do you have any advice on this new API? @gszadovszky |
…dRowGroup compatibility Recreate org.apache.parquet.internal.filter2.columnindex.RowRanges as a deprecated subclass of the relocated org.apache.parquet.filter2.columnindex.RowRanges so the released ParquetFileReader#readFilteredRowGroup(int, RowRanges) signature keeps linking. Add a deprecated readFilteredRowGroup overload taking the old internal type that delegates to the new one, and drop the japicmp exclusions for that method now that it is compatible. Other internal-only surfaces (RowRanges statics, Range, ColumnIndexFilter return types) remain relocated without a bridge.
Rationale for this change
This PR is based on @mbutrovich's previous work.
Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a
RowRangesincrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time.Since
RowRangesis the type external readers must work with, this PR also promotes it out of theorg.apache.parquet.internal.*package into a supported location.What changes are included in this PR?
RowRanges.Builderthat takes a strictly-increasing sequence of selected row indices viaaddSelectedRow(long rowIndex)and coalesces consecutive indices intoRangeentries. Out-of-order, duplicate, or negative calls throwIllegalArgumentException.build()returns an independent snapshot, so the builder can be reused afterwards.Builderisfinalwith a private constructor.RowRanges(withRangeandBuilder) fromorg.apache.parquet.internal.filter2.columnindextoorg.apache.parquet.filter2.columnindex.org.apache.parquet.internal.filter2.columnindex.RowRangessubclass is kept, andParquetFileReader.readFilteredRowGroup(int, RowRanges)gains a deprecated overload (taking the old internal type) that delegates to the new one. Both are marked for removal in 2.0.internal-only surfaces (RowRangesstatics,Range,ColumnIndexFilterreturn types) are relocated without a bridge, as theinternalpackage carries no compatibility guarantee.Are these changes tested?
Yes.
TestRowRangescovers single/multiple/coalesced ranges, the empty builder case, the out-of-order/duplicate/negative rejection paths, and thatbuild()returns a snapshot independent of the builder. ExistingTestColumnIndexFilterandTestParquetFileReaderRowRangescontinue to pass against the relocated type.Are there any user-facing changes?
Yes.
RowRanges(andRange/Builder) is now available at the supported packageorg.apache.parquet.filter2.columnindex. The oldorg.apache.parquet.internal.filter2.columnindex.RowRangesandParquetFileReader.readFilteredRowGroup(int, <internal> RowRanges)remain as deprecated shims and will be removed in 2.0.Closes #3596